Homework 3: Mining Data Streams

By Group 16

In this project, we implement a streaming graph processing algorithm from the paper Stefani, L. D., Epasto, A., Riondato, M., & Upfal, E. (2017). Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. ACM Transactions on Knowledge Discovery from Data (TKDD), 11(4), 1-50 (link:http://www.kdd.org/kdd2016/papers/files/rfp0465-de-stefaniA.pdf). First the reservoir sampling algorithm described in the paper is implemented. Then, the streaming graph algorithms, Trièst-Base and Trièst-Base improved, from the paper that make use of the algorithm implemented in the previous step.

In order to ensure that our implementation is correct, we test out implementation with the following dataset:

General Relativity and Quantum Cosmology

Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes. Link: https://snap.stanford.edu/data/ca-GrQc.html

The data covers papers in the period from January 1993 to April 2003 (124 months). It begins within a few months of the inception of the arXiv, and thus represents essentially the complete history of its GR-QC section.

Dataset statistics
Nodes 5242
Edges 14496
Nodes in largest WCC 4158 (0.793)
Edges in largest WCC 13428 (0.926)
Nodes in largest SCC 4158 (0.793)
Edges in largest SCC 13428 (0.926)
Average clustering coefficient 0.5296
Number of triangles 48260
Fraction of closed triangles 0.3619
Diameter (longest shortest path) 17
90-percentile effective diameter 7.6

Triest-Base

Number of triangles is 48260

Triest-Base improved

  1. UpdateCounters is called unconditionally for each element on the stream, before the algorithm decides whether or not to insert the edge into S.
  2. trièst-impr never decrements the counters when an edge is removed from S.
  3. UpdateCounters performs a weighted increase of the counters using $η^{(t)} = max\{1,(t − 1)(t − 2)/(M(M − 1))\}$ as weight.

Triangle count is 48260

Comparison of Triest-Base and Improved Version

Optional task for extra bonus

What were the challenges you have faced when implementing the algorithm?

In triest, there are a lot of replacements in the sampling dataset at the beginning leading to the deletion of several local counters. Additionally to that, the fact that late edges are more likely to be ignored leads to having no local counters when the size of the sampling dataset is not relatively big. This can be resolved by increasing the sample size or with triest-impr. In general triest is dataset sensitive and benefits edges who are passed close to sample size-time.

Can the algorithm be easily parallelized? If yes, how? If not, why? Explain.

It is possible to parallelize the algorithm. This can be done by partitioning the set into subsets. Another option is to parallelize the counters using Spark since associative and commutative properties are applied to addition and substraction. The associative property states that you can re-group numbers and you will get the same answer and the commutative property states that you can move numbers around and still arrive at the same answer.

Does the algorithm work for unbounded graph streams? Explain.

The algorithm works with time $t$, where $t \geq 0$. More specifically, "when queried at the end of time $t$, trièst-base returns $ξ^{(t)} τ^{(t)}$ as the estimation for the global triangle count". So, when we have an unbounded graph stream, the algorithm can retun an estimation for the global triangle count i.e. we estimate the total global triangle counts so far in the stream.

Does the algorithm support edge deletions? If not, what modification would it need? Explain.

No, but the fully dynamic algorithm proposed in the same paper called Trièst-FD is able to: "It is based on random pairing, a sampling scheme that extends reservoir sampling and can handle deletions. The idea behind the RP scheme is that edge deletions seen on the stream will be compensated by future edge insertions".